This section is dedicated to understanding the data. We
will provide an analysis of the data set using a visual approach in
order to summarize their main characteristics.
Let’s
import the cleaned dataset that we created.
Mountain_data_cleaned <- read.csv("../data/Mountain_data_cleaned.csv")
We will analyze some basic statistical elements for each variable. To do this we need to transform the variable Date to date format.
Mountain_data_cleaned$Date <- as.Date(Mountain_data_cleaned$Date)
We first look at the general aspect of the data set and try to
discover if there are any missing values.
| rows | columns | all_missing_columns | total_missing_values | complete_rows | total_observations |
|---|---|---|---|---|---|
| 430 | 34 | 0 | 2 | 428 | 14620 |
There are 2 missing values in the feature
Glu_P. One is at the instance number 377 and the other
one is at the instance 378.
| Country | Mountain_range | Locality | Plot | Subplot | Date | Glu_P | |
|---|---|---|---|---|---|---|---|
| 377 | Chile | Central Andes | Baños de Colinas | 76 | 2 | 2014-01-21 | NA |
| 378 | Chile | Central Andes | Baños de Colinas | 76 | 3 | 2014-01-21 | NA |
To understand the distribution of our data in the data set we
use the following graph:
All the variables that we will train for our models come from
columns composed of continuous values, which explains their
predominance. We can also notice that the share of missing observations
represents only 0.014% of the total number of observations. Which at
first sight makes it a good data set.
We take our search
for anomalies further by exploring the characteristics of each
variable.For the readability of the report, we show only a few
variables.
## Country
## n missing distinct
## 430 0 2
##
## Value Chile Spain
## Frequency 100 330
## Proportion 0.233 0.767
## Mountain_range
## n missing distinct
## 430 0 3
##
## Value Central Andes Central Pyrenees Sierra de Guadarrama
## Frequency 100 135 195
## Proportion 0.233 0.314 0.453
## Phos_P
## n missing distinct Info Mean Gmd .05 .10
## 430 0 268 1 3.477 2.26 0.6777 1.0753
## .25 .50 .75 .90 .95
## 1.9318 3.1503 4.7146 6.2213 7.1411
##
## lowest : 0.01980997 0.16225689 0.25041658 0.31268048 0.32430518
## highest: 7.73917664 7.90167428 8.05414937 8.64896403 8.64973041
## Glu_P
## n missing distinct Info Mean Gmd .05 .10
## 428 2 272 1 2.102 1.24 0.2986 0.5062
## .25 .50 .75 .90 .95
## 1.3679 2.0922 2.7710 3.2561 3.9541
##
## lowest : 0.1074305 0.1110000 0.1140850 0.1150734 0.1619782
## highest: 4.5538748 4.7529319 5.2224173 5.5441612 6.3505287
## NT_P
## n missing distinct Info Mean Gmd .05 .10
## 430 0 273 1 3.971 2.688 0.4501 0.7760
## .25 .50 .75 .90 .95
## 2.1575 3.9155 5.6767 7.2597 8.0787
##
## lowest : 0.1156901 0.1610857 0.1833210 0.2201596 0.2365415
## highest: 8.4518587 8.9110000 9.0285000 10.8160000 18.0010000
We can observe that there is a big difference between the total
number of observations and the number of distinct observations for the
variables related to the chemical elements. We will try to understand
where this difference comes from in the visual analysis part.
We plot the numerical variables.
Many variables appear with a distribution that looks like the
log-normal distribution.
We will then use box-plots to detect outliers on
numerical variable and compare the distributions to each mountain class.
We can see the real differences between the mountains.
We then plot the categorical variables:
We see that more observations come from Spain, which is normal
since two out of three mountains are located in Spain. The localities
where the samples were taken are almost all composed of a sample of 5
subsamples. Some localities, perhaps more interesting for the study,
were sampled several times, but always by a multiple of 5 subsamples.
We have more observations about the mountain “Sierra de
Guadarrama” (195) compared to “Central Andes” (100) and “Central
Pyrenees” (135). As the differences between the number of observation is
big enough for us to be careful on the results and consider to balance
the data if it is needed.
Can comment on the number of
different observations and the effect on accuracy. Can focus
more on sensibility and sensitivy if needed because indeed
there is twice more informations on Sierra de Guadarrama.
As described above with the summary output of the data, we see that we
have more information on the mountain “Sierra de Guadarrama”. There is
twice more information compared to the mountain “Central Andes”. Our
final result might be affected on a bad way because the model will tend
to produce a good accuracy (so having a tendency to predict “Sierra de
Guadarrama” more often) but it will not be good enough to predict a new
instance.
We will have to see if we will need to balance
our data to get a better model.
We will also inspect the
possible duplicate observations, indeed as previously found, some
variables do not have the whole of their observations which are
distinct.
We notice immediately the poverty of the data concerning the
samples of Sierra de Guadarrama, this function leaves
us with a data set of only 274 observations. We will
therefore try a first time to implement our models by keeping the
duplicates, knowing that identical values in the train set and the test
set will influence the measured accuracy of the model. Then we will test
again our models with the reduced data set to observe if there is a loss
of accuracy.
For the rest of the EDA we will continue the
analysis on the complete data set.
From the correlation plot it seems that some pattern can be
observed. The variables concerning the Phosphatase
enzyme seems to be positively correlated with the variable
about Soil organic carbon.
With this plot, we see indeed that the families of Soil
organic carbon and Phosphatase enzyme are
significantly positively correlated. The correlation coefficient going
from 0.739 (SOC_B - Phos_P) to 0.947 (SOC_T - SOC_P).
The first step is to analyse the data in the covarianve matrix
as we did before, and where we found the positive correlation between
the Soil organic carbon and Phosphatase
enzyme.
The second step is to group the data into
Principal Components.
The third step is to produce a variable
factor map to better understand the role of each factor.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.3080 2.2090 1.6387 1.26851 1.02032 0.97230 0.82856
## Proportion of Variance 0.4377 0.1952 0.1074 0.06436 0.04164 0.03781 0.02746
## Cumulative Proportion 0.4377 0.6329 0.7403 0.80469 0.84633 0.88414 0.91161
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.7106 0.63948 0.58236 0.5050 0.41430 0.37199 0.32284
## Proportion of Variance 0.0202 0.01636 0.01357 0.0102 0.00687 0.00554 0.00417
## Cumulative Proportion 0.9318 0.94816 0.96173 0.9719 0.97879 0.98433 0.98850
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.25571 0.23393 0.21525 0.17576 0.17150 0.1415 0.12739
## Proportion of Variance 0.00262 0.00219 0.00185 0.00124 0.00118 0.0008 0.00065
## Cumulative Proportion 0.99111 0.99330 0.99515 0.99639 0.99757 0.9984 0.99902
## PC22 PC23 PC24 PC25
## Standard deviation 0.1115 0.08012 0.05935 0.04711
## Proportion of Variance 0.0005 0.00026 0.00014 0.00009
## Cumulative Proportion 0.9995 0.99977 0.99991 1.00000
Here, as the command prcomp do not allow NAs in the data. We use the command na.omit on our reduced data containing the numerical values to omit all NAs cases from the data frame.
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 430 individuals, described by 25 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
For the further analysis, we can study as well the eigenvalues in
order to select a good number of components.
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 10.936952114 43.747808455 43.74781
## Dim.2 4.863450112 19.453800446 63.20161
## Dim.3 2.694400932 10.777603728 73.97921
## Dim.4 1.602728782 6.410915127 80.39013
## Dim.5 1.045432762 4.181731049 84.57186
## Dim.6 0.954822102 3.819288408 88.39115
## Dim.7 0.688518236 2.754072944 91.14522
## Dim.8 0.503178240 2.012712959 93.15793
## Dim.9 0.411065663 1.644262652 94.80220
## Dim.10 0.336691388 1.346765554 96.14896
## Dim.11 0.254731126 1.018924506 97.16789
## Dim.12 0.172594714 0.690378855 97.85826
## Dim.13 0.138243331 0.552973325 98.41124
## Dim.14 0.105400519 0.421602078 98.83284
## Dim.15 0.067346066 0.269384264 99.10222
## Dim.16 0.054954619 0.219818475 99.32204
## Dim.17 0.046558386 0.186233542 99.50828
## Dim.18 0.031588992 0.126355968 99.63463
## Dim.19 0.029601829 0.118407318 99.75304
## Dim.20 0.020191077 0.080764308 99.83380
## Dim.21 0.016267999 0.065071996 99.89888
## Dim.22 0.012623924 0.050495696 99.94937
## Dim.23 0.006886114 0.027544454 99.97692
## Dim.24 0.003535893 0.014143571 99.99106
## Dim.25 0.002235080 0.008940321 100.00000
We obtain the cumulative variance, as before, and also the eigenvalues.
Therefore, we can consider the dimension from 1 to 5:
Cumulative variance: 84.57%
Eigenvalue > 1
The variable factor map show the variables and organized them along
dimensions. Here the first two dimensions are represented.
Dimension 1 (x-axis): highly correlated to Phos_T, Phos_B, Phos_P and
Glu_T
Dimension 1 is moderately correlated to PT_B
Dimension 1 is poorly correlated to Cond_T and Cond_B.
Dimension
2 is well correlated to Cond_T and Cond_B.
Dimension 2 is also
moderatly negatively correlated to Radiation.
It seems that we
have 4 groups of variables playing a different role. On these two
dimensions we notice that the mountain classes already separate into 3
distinct clusters
Dim 1: Highly correlated to the PHOS,
SOCand GLU
Dim 2: Correlated
with Cond_P and Cond_T
Dim 3:
Correlated with PT_P, PT_B and
PT_T
Dim 4: Moderately correlated to
K_B and K_T
Dim 5: Correlated
to Radiation
The square cosine shows the importance of a component for a given observation. It is therefore normal that observations close to the origin are less significant than those far from it. Here we decided to represent only one variable of each type since the same chemical elements tend to have the same behavior independently of their sampling method. A variable that has an interesting behavior is Radiation, indeed the more we select high dimensions the more this variable becomes important (except for dimension 4), while the variables related to chemical elements tend to decrease. Thus, we find radiation strongly correlated with dimension 5.
As seen in the EDA, we can consider 5 dimensions. In the following graph we reduce in 3 dimensions the 3 mountains. Clusters may be apparent.
X <- Mountain_data.num%>% na.omit()
prin_comp <- prcomp(X, rank. = 3)
components <- prin_comp[["x"]]
components <- data.frame(components)
components$PC2 <- -components$PC2
components$PC3 <- -components$PC3
components = cbind(components, total$`Mountain_data_cleaned$Mountain_range`)
tit = 'Total Explained Variance = 74,03%'
fig <- plot_ly(components, x = ~PC1, y = ~PC2, z = ~PC3, color = ~total$`Mountain_data_cleaned$Mountain_range`, colors = c('#636EFA','#EF553B','#00CC96') ) %>%
add_markers(size = 12)
fig <- fig %>%
layout(
title = tit,
scene = list(bgcolor = "#e5ecf6")
)
fig
Through this 3D, we can observe the repartition of the 3 mountain in the PCA. ‘Central Pyrenees’ (red points) shows a high correaltion to Dim 1. Further in the analysis we will do a cluster analysis, to better understand the apparent classification between the mountains.